{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 更加透明的方式\n",
    "\n",
    "这里我们不使用Trainer这个高级API，而是用pytorch来实现。\n",
    "\n",
    "\n",
    "## 1. 数据集预处理\n",
    "在Huggingface官方教程里提到，在使用pytorch的dataloader之前，我们需要做一些事情：\n",
    "- 把dataset中一些不需要的列给去掉了，比如‘sentence1’，‘sentence2’等\n",
    "- 把数据转换成pytorch tensors\n",
    "- 修改列名 label 为 labels\n",
    "\n",
    "其他的都好说，但**为啥要修改列名 label 为 labels，好奇怪哦！**\n",
    "这里探究一下：\n",
    "\n",
    "\n",
    "首先，Huggingface的这些transformer Model直接call的时候，接受的标签这个参数是叫\"labels\"。\n",
    "所以不管你使用Trainer，还是原生pytorch去写，最终模型处理的时候，肯定是使用的名为\"labels\"的标签参数。\n",
    "\n",
    "\n",
    "但在Huggingface的datasets中，数据集的标签一般命名为\"label\"或者\"label_ids\"，那为什么在前两集中，我们没有对标签名进行处理呢？\n",
    "\n",
    "这一点在transformer的源码`trainer.py`里找到了端倪：\n",
    "```python\n",
    "# 位置在def _remove_unused_columns函数里\n",
    "# Labels may be named label or label_ids, the default data collator handles that.\n",
    "signature_columns += [\"label\", \"label_ids\"]\n",
    "```\n",
    "这里提示了， data collator 会负责处理标签问题。然后我又去查看了`data_collator.py`中发现了一下内容：\n",
    "```python\n",
    "class DataCollatorWithPadding:\n",
    "    ...\n",
    "    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:\n",
    "        ...\n",
    "        if \"label\" in batch:\n",
    "            batch[\"labels\"] = batch[\"label\"]\n",
    "            del batch[\"label\"]\n",
    "        if \"label_ids\" in batch:\n",
    "            batch[\"labels\"] = batch[\"label_ids\"]\n",
    "            del batch[\"label_ids\"]\n",
    "        return batch\n",
    "```\n",
    "这就真相大白了：不管数据集中提供的标签名叫\"label\"，还是\"label_ids\"，\n",
    "DataCollatorWithPadding 都会帮你转换成\"labels\"，装进batch里，再返回。\n",
    "\n",
    "前面使用Trainer的时候，DataCollatorWithPadding已经帮我们自动转换了，因此我们不需要操心这个问题。\n",
    "\n",
    "但这就是让我疑惑的地方：我们使用pytorch来写，其实也不用管这个，因为在pytorch的data_loader里面，有一个`collate_fn`参数，我们可以把DataCollatorWithPadding对象传进去，也会帮我们自动把\"label\"转换成\"labels\"。因此实际上，这应该是教程中的一个错误，我们不需要手动设计。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Reusing dataset glue (C:\\Users\\Administrator\\.cache\\huggingface\\datasets\\glue\\mrpc\\1.0.0\\dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "b8102a966021470aa4688946db23983f",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/3 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Loading cached processed dataset at C:\\Users\\Administrator\\.cache\\huggingface\\datasets\\glue\\mrpc\\1.0.0\\dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad\\cache-f34d74a51064f292.arrow\n",
      "Loading cached processed dataset at C:\\Users\\Administrator\\.cache\\huggingface\\datasets\\glue\\mrpc\\1.0.0\\dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad\\cache-8114cae97162778f.arrow\n",
      "Loading cached processed dataset at C:\\Users\\Administrator\\.cache\\huggingface\\datasets\\glue\\mrpc\\1.0.0\\dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad\\cache-4b384cc92726f5c6.arrow\n"
     ]
    }
   ],
   "source": [
    "from datasets import load_dataset\n",
    "from transformers import AutoTokenizer, DataCollatorWithPadding\n",
    "\n",
    "raw_datasets = load_dataset(\"glue\", \"mrpc\")\n",
    "checkpoint = \"bert-base-uncased\"\n",
    "tokenizer = AutoTokenizer.from_pretrained(checkpoint)\n",
    "\n",
    "def tokenize_function(example):\n",
    "    return tokenizer(example[\"sentence1\"], example[\"sentence2\"], truncation=True)\n",
    "\n",
    "tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)\n",
    "data_collator = DataCollatorWithPadding(tokenizer=tokenizer)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['attention_mask', 'idx', 'input_ids', 'label', 'sentence1', 'sentence2', 'token_type_ids']\n"
     ]
    }
   ],
   "source": [
    "print(tokenized_datasets['train'].column_names)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "huggingface datasets贴心地准备了三个方法：`remove_columns`, `rename_column`, `set_format`\n",
    "\n",
    "来方便我们为pytorch的dataloader做准备："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['attention_mask', 'input_ids', 'label', 'token_type_ids']\n"
     ]
    }
   ],
   "source": [
    "tokenized_datasets = tokenized_datasets.remove_columns(['sentence1', 'sentence2','idx'])\n",
    "# tokenized_datasets = tokenized_datasets.rename_column('label','labels')\n",
    "tokenized_datasets.set_format('torch')\n",
    "\n",
    "print(tokenized_datasets['train'].column_names)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Dataset({\n",
       "    features: ['attention_mask', 'input_ids', 'label', 'token_type_ids'],\n",
       "    num_rows: 3668\n",
       "})"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tokenized_datasets['train']  # 经过上面的处理，它就可以直接丢进pytorch的Dataloader中了，跟pytorch中的Dataset格式已经一样了"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "定义我们的pytorch dataloaders：\n",
    "\n",
    "在pytorch的DataLoader里，有一个`collate_fn`参数，其定义是：\"merges a list of samples to form a mini-batch of Tensor(s).  Used when using batched loading from a map-style dataset.\" 我们可以直接把Huggingface的DataCollatorWithPadding对象传进去，用于对数据进行padding等一系列处理："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "from torch.utils.data import DataLoader, Dataset\n",
    "train_dataloader = DataLoader(tokenized_datasets['train'], shuffle=True, batch_size=8, collate_fn=data_collator)  # 通过这里的dataloader，每个batch的seq_len可能不同\n",
    "eval_dataloader = DataLoader(tokenized_datasets['validation'], batch_size=8, collate_fn=data_collator)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'attention_mask': torch.Size([8, 72]),\n",
       " 'input_ids': torch.Size([8, 72]),\n",
       " 'token_type_ids': torch.Size([8, 72]),\n",
       " 'labels': torch.Size([8])}"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 查看一下train_dataloader的元素长啥样\n",
    "for batch in train_dataloader:\n",
    "    break\n",
    "{k: v.shape for k, v in batch.items()}\n",
    "# 可见都是长度为72，size=8的batch"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. 模型"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']\n",
      "- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
      "- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n",
      "Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']\n",
      "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"
     ]
    }
   ],
   "source": [
    "from transformers import AutoModelForSequenceClassification\n",
    "\n",
    "model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "SequenceClassifierOutput(loss=tensor(0.7563, grad_fn=<NllLossBackward>), logits=tensor([[-0.2171, -0.4416],\n",
       "        [-0.2248, -0.4694],\n",
       "        [-0.2440, -0.4664],\n",
       "        [-0.2421, -0.4510],\n",
       "        [-0.2273, -0.4545],\n",
       "        [-0.2339, -0.4515],\n",
       "        [-0.2334, -0.4387],\n",
       "        [-0.2362, -0.4601]], grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model(**batch)  # 这样的batch可以直接丢进模型处理"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## optimizer 和 learning rate scheduler"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1377\n"
     ]
    }
   ],
   "source": [
    "from transformers import AdamW, get_scheduler\n",
    "\n",
    "optimizer = AdamW(model.parameters(), lr=5e-5)\n",
    "\n",
    "num_epochs = 3\n",
    "num_training_steps = num_epochs * len(train_dataloader)  # num of batches * num of epochs\n",
    "lr_scheduler = get_scheduler(\n",
    "    'linear',\n",
    "    optimizer=optimizer,  # scheduler是针对optimizer的lr的\n",
    "    num_warmup_steps=0,\n",
    "    num_training_steps=num_training_steps)\n",
    "print(num_training_steps)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Training"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "device(type='cuda')"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import torch\n",
    "\n",
    "device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n",
    "model.to(device)\n",
    "\n",
    "device"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## training loops:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 459/459 [01:54<00:00,  4.01it/s]\n",
      "100%|██████████| 459/459 [01:55<00:00,  3.98it/s]\n",
      "100%|██████████| 459/459 [01:55<00:00,  3.96it/s]\n"
     ]
    }
   ],
   "source": [
    "from tqdm import tqdm\n",
    "\n",
    "for epoch in range(num_epochs):\n",
    "    for batch in tqdm(train_dataloader):\n",
    "        # 要在GPU上训练，需要把数据集都移动到GPU上：\n",
    "        batch = {k:v.to(device) for k,v in batch.items()}\n",
    "        loss = model(**batch).loss\n",
    "        loss.backward()\n",
    "        optimizer.step()\n",
    "        lr_scheduler.step()\n",
    "        optimizer.zero_grad()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Evaluation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'accuracy': 0.8651960784313726, 'f1': 0.9050086355785838}"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from datasets import load_metric\n",
    "\n",
    "metric= load_metric(\"glue\", \"mrpc\")\n",
    "model.eval()\n",
    "for batch in eval_dataloader:\n",
    "    batch = {k: v.to(device) for k, v in batch.items()}\n",
    "    with torch.no_grad():  # evaluation的时候不需要算梯度\n",
    "        outputs = model(**batch)\n",
    "    \n",
    "    logits = outputs.logits\n",
    "    predictions = torch.argmax(logits, dim=-1)\n",
    "    metric.add_batch(predictions=predictions, references=batch[\"labels\"])\n",
    "\n",
    "metric.compute()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. 使用 Accelerate 库进一步加速\n",
    "The training loop we defined earlier works fine on a single CPU or GPU. But using the 🤗 Accelerate library, with just a few adjustments we can enable distributed training on multiple GPUs or TPUs.\n",
    "\n",
    "日后再说吧~"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
